Learning Objectives

By the end of this practical lab you will be able to:

SQL in R

It is most common to be running an SQL statement directly on a database, however, for this practical we will illustrate some of these basic operations in R using the sqldf package. This enables SQL statements to be run directly on data frame objects as you would with a database table. Later we will see how we can run explicitly spatial queries from within R on the Carto platform, which presents a customized version of the spatial database PostGIS. This is not meant to be tutorial on databases, which as a topic is both complex and much wider in scope than there is room to present here, however, this practical should give some basic skills in how these resources could be queried.

In the sqldf() part of this practical we utilize three datasets, the first of dataset we use here relates to 311 calls within San Francisco, the second is a table of Workplace Area Characteristics (WAC) from the LEHD Origin-Destination Employment Statistics (LODES), and finally a “crosswalk” table which provides a lookup between blocks and census tracts.

First we will read in the 311 data:

#Read data
tract_311 <- read.csv("./data/311_Tract_Coded.csv")

#The content includes the tract code and the category of the 311 call
head(tract_311)
##      GEOID10                     Category
## 1 6075980600            Abandoned Vehicle
## 2 6075980600     Graffiti Public Property
## 3 6075980600 Street and Sidewalk Cleaning
## 4 6075980600 Street and Sidewalk Cleaning
## 5 6075980600 Street and Sidewalk Cleaning
## 6 6075980600 Street and Sidewalk Cleaning

Next we will read an extract of the WAC LODES data for California. The table holds a range of variables, but of interest here are the number of workers by different types of employment. Unlike the previous data, each row (and associated ID) of relate to blocks rather than tracts.

#Read in the WAC data
WAC <- read.csv("./data/ca_wac_SI03_JT00_2014.csv")
head(WAC)
##    w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 CNS03 CNS04
## 1 6.0014e+13   45   17   15   13   16    7   22     0     0     0     0
## 2 6.0014e+13   26    3   15    8    3    5   18     0     0     0     0
## 3 6.0014e+13    9    2    3    4    1    1    7     0     0     0     0
## 4 6.0014e+13   19    9    9    1    9    6    4     0     0     0     0
## 5 6.0014e+13    4    1    2    1    0    3    1     0     0     0     0
## 6 6.0014e+13   11    2    1    8    5    4    2     0     0     0     0
##   CNS05 CNS06 CNS07 CNS08 CNS09 CNS10 CNS11 CNS12 CNS13 CNS14 CNS15 CNS16
## 1     0     0     0     0     0     0     0    23     0     0     0     0
## 2     0     0     0     0    24     2     0     0     0     0     0     0
## 3     0     0     0     0     0     0     0     9     0     0     0     0
## 4     0     0     0     0     0     0     0     0     0     0     1     1
## 5     0     0     0     0     0     0     0     2     0     0     0     0
## 6     0     0     0     0     0     0     0     5     0     0     0     0
##   CNS17 CNS18 CNS19 CNS20 CR01 CR02 CR03 CR04 CR05 CR07 CT01 CT02 CD01
## 1     0    16     6     0   36    1    0    6    1    1   38    7    3
## 2     0     0     0     0   22    0    0    4    0    0   24    2    2
## 3     0     0     0     0    8    0    0    0    0    1    9    0    0
## 4     0    11     6     0   12    4    0    2    0    1   17    2    2
## 5     0     0     2     0    2    0    0    2    0    0    2    2    2
## 6     0     5     1     0    9    0    0    2    0    0    9    2    2
##   CD02 CD03 CD04 CS01 CS02 CFA01 CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03
## 1    6    7   12    8   37     0     0     0     0     0     0     0     0
## 2    2   10    9   13   13     0     0     0     0     0     0     0     0
## 3    2    1    4    7    2     0     0     0     0     0     0     0     0
## 4    7    1    0    7   12     0     0     0     0     0     0     0     0
## 5    0    1    0    1    3     0     0     0     0     0     0     0     0
## 6    0    4    3    5    6     0     0     0     0     0     0     0     0
##   CFS04 CFS05 createdate
## 1     0     0   20160219
## 2     0     0   20160219
## 3     0     0   20160219
## 4     0     0   20160219
## 5     0     0   20160219
## 6     0     0   20160219
#Given the length of the block ID values R defaults to displaying these in scientific notation; however we can turn this off using the following option
options(scipen = 999)

The next csv we will read in is a “crosswalk” (lookup) that shows how one geography relates to another - in this case, blocks to tracts.

#read crosswalk (original location: https://lehd.ces.census.gov/data/lodes/LODES7/ca/ca_xwalk.csv.gz)
crosswalk <- read.csv("./data/ca_xwalk.csv")
crosswalk <- subset(crosswalk, select = c("tabblk2010","trct")) #Keep just the Block and Tract IDs
head(crosswalk)
##       tabblk2010       trct
## 1 60014001001000 6001400100
## 2 60014001001001 6001400100
## 3 60014001001002 6001400100
## 4 60014001001003 6001400100
## 5 60014001001004 6001400100
## 6 60014001001005 6001400100

Before we will begin there is a need to load a couple of required packages:

install.packages("sqldf")
# Load library
library(sqldf)

# Load library
library(rgdal)

Basic SQL Queries

The select statement returns a subset of a table and has various options:

# Select all of the rows and columns in the tract_311 table - the use of * means all; essentially, duplicates the table
tmp <- sqldf('SELECT * from tract_311')

# Select only the block ID and variable that relates to jobs in "Arts, Entertainment, and Recreation (CNS17)" from the WAC data frame
AER_311 <- sqldf('SELECT w_geocode, CNS17 FROM WAC')
head(AER_311) # Shows the head of the new data frame
##        w_geocode CNS17
## 1 60014001001007     0
## 2 60014001001008     0
## 3 60014001001017     0
## 4 60014001001024     0
## 5 60014001001026     0
## 6 60014001001027     0
# Select only the block ID and variable that relates to jobs in "Arts, Entertainment, and Recreation (CNS17)" from the WAC data frame; however sort by CNS17 and in decenting order (DESC) - removing "DESC" would return the default ascending order
AER_311 <- sqldf('SELECT w_geocode, CNS17 FROM WAC ORDER BY CNS17 DESC')
head(AER_311) # Shows the head of the new data frame
##        w_geocode CNS17
## 1 60599800001001 19424
## 2 60372074001004  6104
## 3 60730076001085  3940
## 4 60591104023001  3802
## 5 60375323033038  2717
## 6 60373109001000  2637
# You can use an AS option to rename a variable - here CNS17 is presented as AER
AER_311 <- sqldf('SELECT w_geocode, CNS17 AS AER FROM WAC')
head(AER_311)
##        w_geocode AER
## 1 60014001001007   0
## 2 60014001001008   0
## 3 60014001001017   0
## 4 60014001001024   0
## 5 60014001001026   0
## 6 60014001001027   0

New tables were created previously using the <- syntax within R; however when using SQL within a database environment you can remove (drop) or create a table based on a select query as follows:

"DROP TABLE AER_311";
"CREATE TABLE AER_311 AS SELECT w_geocode, CNS17 AS AER FROM WAC";

In addition to selecting the columns that appear in a final result we can also use the LIMIT and WHERE option to subset the returned results to a set of rows:

#Returns all columns but only the first 10 rows
sqldf('SELECT * from tract_311 LIMIT 10')
##       GEOID10                     Category
## 1  6075980600            Abandoned Vehicle
## 2  6075980600     Graffiti Public Property
## 3  6075980600 Street and Sidewalk Cleaning
## 4  6075980600 Street and Sidewalk Cleaning
## 5  6075980600 Street and Sidewalk Cleaning
## 6  6075980600 Street and Sidewalk Cleaning
## 7  6075980600 Street and Sidewalk Cleaning
## 8  6075980600 Street and Sidewalk Cleaning
## 9  6075980600 Street and Sidewalk Cleaning
## 10 6075980600                MUNI Feedback
#Return all records where the Category is "Noise Report"
noise_311 <- sqldf('SELECT * from tract_311 WHERE Category = "Noise Report"')

Another useful feature of SQL is the ability to join tables. There are four main types of join statements which use the terms left and right to infer the first or second presented table within an SQL query:

Join Type Description
Inner Join Only those rows in both tables which match are returned
Left Join All rows in the left table, plus matching rows in the right table
Right Join All rows in the right table, plus matching rows in the left table
Outer Join All rows in both tables, matching and non matching are returned - essentially combining a left and right join

An example of a left join is shown below - this uses “AS A” AND “AS B” after selecting the table; these are call alias and can be used in replacement for the table names. The resulting table now has the “trct” column appended.

# Left Join
AER_311_Tract <- sqldf("SELECT * from AER_311 AS A LEFT JOIN crosswalk AS B on A.w_geocode = B.tabblk2010") 
#Show the top six rows of the table
head(AER_311_Tract)
##        w_geocode AER     tabblk2010       trct
## 1 60014001001007   0 60014001001007 6001400100
## 2 60014001001008   0 60014001001008 6001400100
## 3 60014001001017   0 60014001001017 6001400100
## 4 60014001001024   0 60014001001024 6001400100
## 5 60014001001026   0 60014001001026 6001400100
## 6 60014001001027   0 60014001001027 6001400100

Aggregating data in SQL

We can now adapt this query so in addition to the join, we can also aggregate the data by Census Tract - summing the instance of the AER variable. This requires the use of a “GROUP BY” statement, then an aggregate function - in this case sum(). In the select part of the statement we also now only call two columns - the AER counts from the AER_311 table, then the ID from the crosswalk table. You will see that the sum is wrapped around “A.AER” as there will be multiple rows of data for each census tract.

# Left Join and group by
AER_311_Tract <- sqldf("SELECT sum(A.AER) AS AER, B.trct from AER_311 AS A LEFT JOIN crosswalk AS B on A.w_geocode = B.tabblk2010 GROUP BY trct") 
#Show the top six rows of the table
head(AER_311_Tract)
##   AER       trct
## 1   0 6001400100
## 2   1 6001400200
## 3   8 6001400300
## 4   2 6001400400
## 5   0 6001400500
## 6   2 6001400600

We now have created a tract level count of the number of workers in “Arts, Entertainment, and Recreation” (AER) jobs. We will now aggregate the “noise_311” table that we created earlier using the count() function. Each row in this table is a separate incidence, however the records also have the tract that they occurred in appended.

#Group by with count
noise_311_tract <- sqldf("SELECT GEOID10, count(Category) as Noise_C FROM noise_311 GROUP BY GEOID10")

We can now merge the number of workers in “Arts, Entertainment, and Recreation” and those recorded noise complaints.

# Merge the two tables
noise_AER_tract <- merge(AER_311_Tract,noise_311_tract, by.x="trct",by.y = "GEOID10")

Spatial Queries

Spatially enabled databases allow queries to be run that are explicitly spatial; for example, what is the shortest path between two locations or which points lie within a polygon. Such queries are equivalent to those you might run in a standard GIS. However, typically, the volume of data that can be queried efficiently is usually larger within a database. There are many different spatial databases, however one of the most popular open source database is PostGIS. Although it is possible to download and set this up locally and use with R (see the postGIStools package), we will adopt a different approach here.

The company Carto provide a web front end to postGIS and through the package “cartodb-r” the functionality of postGIS can be utilized within R. The other advantage of this approach is that it is very scalable to large data as the Carto system exists within the cloud.

Carto Signup and Initialisation Process

You will first need to create a Carto account by visiting: carto.com/signup/.

At this point it might be worth having a look at the Carto guides which give a practical overview of the service.

We will now import our data into Carto - drag the “Brooklyn_Trees.geojson” file onto the dashboard window to import the data. This is an extract of a full tree inventory for NYC. Once the upload has completed you will be redirected to the Carto Builder interface - this can then be used to create a web map and dashboard interface.

You will see on the left hand side that the dataset has been added as a layer called “brooklyn_trees”.

However, we will now add a further dataset “Brooklyn_Neigh_Tab.geojson” which delineates neighborhoods in Brooklyn. To upload this file, click the blue “Add” button from the left hand side, then select “Connect Dataset” from the add a new layer window that opens. Then you can browse and add this additional layer. This will then take you back to the map view, upload and then display the file. The layer is called “brooklyn_neigh_tab” and can be seen on the left hand side. Repeat this process for the “subway.geojson” 1 and “subway_stations.geojson”2.

You can look at the imported data on the Carto interface:

We will now install two required packages and a fork of the CartoDB package which has had a number of bug fixes.

#Download required packages
install.packages("RCurl")
install.packages("RJSONIO")
install.packages("devtools")

# Devtools enables packages to be loaded from github
library(devtools)

# Install CatoDB package
install_github("qszhao/cartodb-r", subdir = "CartoDB")
# Load package
library(CartoDB)

Although optional, if you can acquire an API key, this also enables R to write to Carto. Without an API, the connection can be established, but you will only be able to read data. To obtain a free CARTO student account with API access, please visit this page to apply the carto student account

You can connect to Carto using the cartodb() statement which requires your account name, and also an optional API key which enables you to write results to Carto.

# Connects R to Carto
#cartodb("urbananalytics", api.key="optional-key-for-writes") #optional API key
# Connect to your Carto instance - change the name in the "" to your account name
cartodb("urbananalytics")
cartodb.test()

Returning Tabular and Spatial Objects

Now that we have connected to Carto, we can pull data down into R using the cartodb.collection() function. We will first extract ten rows of the “brooklyn_trees” table (as the total size of this table is very large). We will

# 
#Download the top ten rows of the table "brooklyn_trees"
tmp <- cartodb.collection("brooklyn_trees", geomAs="XY", limit=10)
tmp
##    cartodb_id     tree_speci   serial the_geom_x the_geom_y
## 1          45         Ginkgo 199,896B   -73.9833    40.6892
## 2         577 Other Magnolia 202,524B   -73.9708    40.6069
## 3         758        Pin Oak 224,721B   -73.9837    40.6696
## 4        2282        Pin Oak 207,753B   -73.9706    40.6388
## 5        4389 Kwanzan Cherry     NULL   -74.0097     40.606
## 6        5769  Silver Linden     NULL   -73.9679    40.5779
## 7        7421  Silver Linden     4673   -73.9857    40.6146
## 8        7603         Ginkgo    38923   -73.9167    40.6624
## 9        8517    Scarlet Oak    26054   -73.8748    40.6767
## 10      13935   Pond Cypress    63962   -73.9453    40.6928

Data stored within a spatial database typically has an additional column which in the case of Carto (and PostGIS) is named by default as “the_geom”. This stores the spatial information that is applicable to each row of the database. In the case of the table “brooklyn_trees”, this is the latitude and longitude coordinates of each tree record; and in “brooklyn_neigh_tab”, is a set of linestrings delineating the polygons bounding each of the neighborhood zones. The addition of the geomAS=“XY” converted these to latitude and longitude in the results of the returned query.

It is also possible to use SQL directly on the Carto table:

tmp
##    cartodb_id     tree_speci   serial the_geom_x the_geom_y
## 1          45         Ginkgo 199,896B   -73.9833    40.6892
## 2         577 Other Magnolia 202,524B   -73.9708    40.6069
## 3         758        Pin Oak 224,721B   -73.9837    40.6696
## 4        2282        Pin Oak 207,753B   -73.9706    40.6388
## 5        4389 Kwanzan Cherry     NULL   -74.0097     40.606
## 6        5769  Silver Linden     NULL   -73.9679    40.5779
## 7        7421  Silver Linden     4673   -73.9857    40.6146
## 8        7603         Ginkgo    38923   -73.9167    40.6624
## 9        8517    Scarlet Oak    26054   -73.8748    40.6767
## 10      13935   Pond Cypress    63962   -73.9453    40.6928
#Same query, but supplied as a formal SQL query
tmp <- cartodb.collection(sql="SELECT cartodb_id, the_geom, tree_speci, serial FROM brooklyn_trees LIMIT 10")
tmp
##    cartodb_id                                           the_geom
## 1          45 0101000020E6100000F2070060EE7E52C0BD0A00C037584440
## 2         577 0101000020E61000000EF8FF9F217E52C0F50D00E0AE4D4440
## 3         758 0101000020E610000027050000F57E52C0AEEFFF7FB5554440
## 4        2282 0101000020E6100000710600401E7E52C022FCFF3FC4514440
## 5        4389 0101000020E610000079F2FFDF9E8052C00CF5FF5F914D4440
## 6        5769 0101000020E610000011FEFF1FF27D52C0421500A0F8494440
## 7        7421 0101000020E6100000B0F2FFBF157F52C057190040AB4E4440
## 8        7603 0101000020E6100000DAFDFF3FAB7A52C04C040080C9544440
## 9        8517 0101000020E610000067F4FFBFFC7752C074E9FF1F9E564440
## 10      13935 0101000020E6100000FEFCFFBF7F7C52C0EDFEFF9FAD584440
##        tree_speci   serial
## 1          Ginkgo 199,896B
## 2  Other Magnolia 202,524B
## 3         Pin Oak 224,721B
## 4         Pin Oak 207,753B
## 5  Kwanzan Cherry     NULL
## 6   Silver Linden     NULL
## 7   Silver Linden     4673
## 8          Ginkgo    38923
## 9     Scarlet Oak    26054
## 10   Pond Cypress    63962

You will have seen that the_geom was returned directly, however, we can also use PostGIS functions to convert this into different formats - ST_AsText() as a Well Known Text format; or ST_X() and ST_Y() to return the longitude and latitude. You will see that AS is also used again to give the new columns a name.

tmp
##    cartodb_id                                           the_geom
## 1          45 0101000020E6100000F2070060EE7E52C0BD0A00C037584440
## 2         577 0101000020E61000000EF8FF9F217E52C0F50D00E0AE4D4440
## 3         758 0101000020E610000027050000F57E52C0AEEFFF7FB5554440
## 4        2282 0101000020E6100000710600401E7E52C022FCFF3FC4514440
## 5        4389 0101000020E610000079F2FFDF9E8052C00CF5FF5F914D4440
## 6        5769 0101000020E610000011FEFF1FF27D52C0421500A0F8494440
## 7        7421 0101000020E6100000B0F2FFBF157F52C057190040AB4E4440
## 8        7603 0101000020E6100000DAFDFF3FAB7A52C04C040080C9544440
## 9        8517 0101000020E610000067F4FFBFFC7752C074E9FF1F9E564440
## 10      13935 0101000020E6100000FEFCFFBF7F7C52C0EDFEFF9FAD584440
##        tree_speci   serial
## 1          Ginkgo 199,896B
## 2  Other Magnolia 202,524B
## 3         Pin Oak 224,721B
## 4         Pin Oak 207,753B
## 5  Kwanzan Cherry     NULL
## 6   Silver Linden     NULL
## 7   Silver Linden     4673
## 8          Ginkgo    38923
## 9     Scarlet Oak    26054
## 10   Pond Cypress    63962
#Same query, but supplied as a formal SQL query
tmp <- cartodb.collection(sql="SELECT cartodb_id,  ST_AsText(the_geom), ST_X(the_geom) AS Lon, ST_Y(the_geom) AS Lat, tree_speci, serial FROM brooklyn_trees LIMIT 10")
tmp
##    cartodb_id                           st_astext      lon     lat
## 1          45  POINT(-73.9832992554 40.689201355) -73.9833 40.6892
## 2         577 POINT(-73.9708023071 40.6068992615) -73.9708 40.6069
## 3         758 POINT(-73.9837036133 40.6696014404) -73.9837 40.6696
## 4        2282 POINT(-73.9705963135 40.6388015747) -73.9706 40.6388
## 5        4389 POINT(-74.0096969604 40.6059989929) -74.0097  40.606
## 6        5769 POINT(-73.9679031372 40.5778999329) -73.9679 40.5779
## 7        7421 POINT(-73.9857025146 40.6146011353) -73.9857 40.6146
## 8        7603  POINT(-73.9167022705 40.662399292) -73.9167 40.6624
## 9        8517  POINT(-73.8748016357 40.676700592) -73.8748 40.6767
## 10      13935 POINT(-73.9452972412 40.6927986145) -73.9453 40.6928
##        tree_speci   serial
## 1          Ginkgo 199,896B
## 2  Other Magnolia 202,524B
## 3         Pin Oak 224,721B
## 4         Pin Oak 207,753B
## 5  Kwanzan Cherry     NULL
## 6   Silver Linden     NULL
## 7   Silver Linden     4673
## 8          Ginkgo    38923
## 9     Scarlet Oak    26054
## 10   Pond Cypress    63962

Although it is useful to return attributes of the data stored on Carto as tables, it is also possible to return these as spatial data by adding a ‘method=“GeoJSON”’ option. If you are using this with an SQL statement as above, an error will be returned unless you also have “the_geom” as part of the SELECT statement (not wrapped in a function - e.g. ST_AsText).

This example creates a new object called “zones” which contains GeoJSON boundaries that were read from the “brooklyn_neigh_tab” table. Note that the function also specified which attributes from the table to include.

#Get the GeoJSON
zones <- cartodb.collection("brooklyn_neigh_tab",columns=c("cartodb_id","the_geom"),method="GeoJSON")
#Create a SpatialPolygonsDataFrame obect
zones.SP <-readOGR(zones,layer = 'OGRGeoJSON',verbose =FALSE) #The "verbose = FALSE" stops readOGR printing the content of the geoJSON to the R console
#Plot the zones
plot(zones.SP)

Basic Spatial Analysis using Carto

As discussed earlier, PostGIS provides the underlying technology for Carto and there is a huge array of spatial analysis functionality. These cover a wide array of analysis and data types, with their specification detailed fully in the PostGIS documentation. We will however illustrate a number of the common features here.

Area Calculation

# Return the areas of polygons
test<- cartodb.collection(sql="SELECT cartodb_id, ntaname, ST_Area(the_geom::geography) / 1000000 AS Area_KM FROM brooklyn_neigh_tab")

In the above query we have added “::geography” after “the_geom” to return the area calculation in meters, which we then divide by 1000000 (one square KM - 1000 * 1000) to give the result in kilometers (KM).

Centroid of an Object

#Return the centroid of the polygons
test <- cartodb.collection(sql="SELECT cartodb_id, ntaname, ST_Centroid(the_geom) as the_geom FROM brooklyn_neigh_tab",method="GeoJSON")

# Plot the results
plot(readOGR(test,layer = 'OGRGeoJSON',verbose =FALSE))

Here we used the ST_Centroid() function to calculate the centroid locations from the zones, and returned this as “the_geom”. As demonstrated before, the results can be returned as GeoJSON and then read into R with readOGR().

Buffers

#Return buffers
buffers <- cartodb.collection(sql="SELECT cartodb_id,  ST_Buffer(the_geom::geography,200) as the_geom FROM brooklyn_trees WHERE tree_speci = 'Green Ash'",method="GeoJSON")

# Plot the results
buffers.SP <- readOGR(buffers,layer = 'OGRGeoJSON',verbose =FALSE)
#Plot first 100 tree buffers
plot(buffers.SP[1:100,])

This SQL uses the ST_Buffer() function to return the buffers (as polygons). The “::geography” option is used again, and as such, the number specified within the function is in meters. The buffer set is therefore 200 meters, and the results are limited to just those trees with a species type of “Green Ash”.

Intersections

Two tables not yet used include the location of subway lines and stations in NYC. These extend beyond the boundary of Brooklyn so we will demonstrate here how these data can be limited to these extents within Carto. First we will use the ST_Intersects() function to extract the stations within Brooklyn:

# Return intersection of stations with Brooklyn neighborhoods
intersection_stations <- cartodb.collection(sql="SELECT a.*,b.ntaname FROM subway_stations AS a, brooklyn_neigh_tab AS b WHERE ST_Intersects(a.the_geom, b.the_geom)",method="GeoJSON") # b.ntaname is added to the select statement to append the neighbouhood names that the stations are located within

# Plot the results
intersection_stations.SP <- readOGR(intersection_stations,layer = 'OGRGeoJSON',verbose =FALSE)
plot(intersection_stations.SP)

#We can also see how each of the stations now have the neighborhood appended
head(intersection_stations.SP@data)
##   cartodb_id                                        url  line
## 1        329 http://www.mta.info/nyct/subway/index.html   B-Q
## 2        346 http://www.mta.info/nyct/subway/index.html   B-Q
## 3        388 http://www.mta.info/nyct/subway/index.html   N-R
## 4        389 http://www.mta.info/nyct/subway/index.html   N-R
## 5        392 http://www.mta.info/nyct/subway/index.html   N-R
## 6        390 http://www.mta.info/nyct/subway/index.html D-N-R
##             name                                        ntaname
## 1 Sheepshead Bay Sheepshead Bay-Gerritsen Beach-Manhattan Beach
## 2        Neck Rd Sheepshead Bay-Gerritsen Beach-Manhattan Beach
## 3        59th St                               Sunset Park West
## 4        45th St                               Sunset Park West
## 5        53rd St                               Sunset Park West
## 6        36th St                               Sunset Park West

In the above SQL, you will see the use of “AS a” and “AS b” in relation to the tables - these are alias which means you don’t have to refer to the full table in the function “ST_Intersects(a.the_geom, b.the_geom)”. This could be written fully as “ST_Intersects(subway_stations.the_geom, subway_stations.the_geom)”.

We will now also write the results of this query back to Carto using “CREATE TABLE subway_stations_brooklyn AS”, however this will only work if the API key was set earlier:

# Write the query to Carto
cartodb.collection(sql="CREATE TABLE subway_stations_brooklyn AS SELECT a.* FROM subway_stations AS a, brooklyn_neigh_tab AS b WHERE ST_Intersects(a.the_geom, b.the_geom)")
## Warning in cartodb.df(sql): relation "subway_stations_brooklyn" already
## exists
## NULL
# In order to make the table visible on the datasets page, this statement is also required
cartodb.collection(sql="select cdb_cartodbfytable('subway_stations_brooklyn')")
## Warning in cartodb.df(sql): Please set user quota before cartodbfying
## tables.
## NULL

Here is another example of using ST_Intersects() for the line geometry:

# Return intersection of stations with Brooklyn neighborhoods
intersection_lines <- cartodb.collection(sql="SELECT a.* FROM subway a, brooklyn_neigh_tab b WHERE ST_Intersects(a.the_geom, b.the_geom)",method="GeoJSON")

# Plot the results
intersection_lines.SP <- readOGR(intersection_lines,layer = 'OGRGeoJSON',verbose =FALSE)
plot(intersection_lines.SP)

You can see how the extents have been cut out of the full network as follows - the selected lines are in red:

Calculating distances between objects

Earlier we created a new table containing the geometry for the subway stations in Brooklyn. We will now use the ST_Distance() function to calculate the distance from a latitude and longitude location (Carto offices in Brooklyn) to the subway stations.

Here we can calculate the distance to the subway entrances from a new geometry that we create using ST_Point(), which is assigned the projection WGS84 (i.e. SRID 4326) using the ST_SetSRID() function.

distance_station <- cartodb.collection(sql="SELECT name, ST_Distance(the_geom::geography, ST_SetSRID(ST_Point(-73.9365832, 40.7042924),4326)::geography) as distance FROM subway_stations_brooklyn ORDER BY distance")

#Show the closest six subway stations
head(distance_station)
##           name distance
## 1   Morgan Ave 356.3292
## 2 Montrose Ave 438.1813
## 3 Flushing Ave 592.0638
## 4   Myrtle Ave 792.2829
## 5     Grand St 873.8736
## 6   Lorimer St 911.6874

Further resources / training


  1. Sourced from: https://wfs.gc.cuny.edu/SRomalewski/MTA_GISdata/June2010_update/nyctsubwayroutes_100627.zip

  2. Sourced from: https://nycopendata.socrata.com/Transportation/Subway-Stations/arq3-7z49